import AuthorDetails from '@site/src/components/AuthorDetails';

## [HFClient vLLM](https://github.com/vllm-project/vllm)

### Prerequisites - Launching vLLM Server locally

Refer to the [vLLM Server API](/api/local_language_model_clients/vLLM) for setting up the vLLM server locally.

```bash
#Example vLLM Server Launch

 python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-hf --port 8080
```

This command will start the server and make it accessible at `http://localhost:8080`.


### Setting up the vLLM Client

The constructor initializes the `HFModel` base class to support the handling of prompting models, configuring the client for communicating with the hosted vLLM server to generate requests. This requires the following parameters:

- `model` (_str_): ID of model connected to the vLLM server.
- `port` (_int_): Port for communicating to the vLLM server. 
- `url` (_str_): Base URL of hosted vLLM server. This will often be `"http://localhost"`.
- `**kwargs`: Additional keyword arguments to configure the vLLM client.

Example of the vLLM constructor:

```python
class HFClientVLLM(HFModel):
    def __init__(self, model, port, url="http://localhost", **kwargs):
```

### Under the Hood

#### `_generate(self, prompt, **kwargs) -> dict`

**Parameters:**
- `prompt` (_str_): Prompt to send to model hosted on vLLM server.
- `**kwargs`: Additional keyword arguments for completion request.

**Returns:**
- `dict`: dictionary with `prompt` and list of response `choices`.

Internally, the method handles the specifics of preparing the request prompt and corresponding payload to obtain the response. 

After generation, the method parses the JSON response received from the server and retrieves the output through `json_response["choices"]` and stored as the `completions` list.

Lastly, the method constructs the response dictionary with two keys: the original request `prompt` and `choices`, a list of dictionaries representing generated completions with the key `text` holding the response's generated text.

### Using the vLLM Client

```python
vllm_llama2 = dspy.HFClientVLLM(model="meta-llama/Llama-2-7b-hf", port=8080, url="http://localhost")
```

### Sending Requests via vLLM Client

1) _**Recommended**_ Configure default LM using `dspy.configure`.

This allows you to define programs in DSPy and simply call modules on your input fields, having DSPy internally call the prompt on the configured LM.

```python
dspy.configure(lm=vllm_llama2)

#Example DSPy CoT QA program
qa = dspy.ChainOfThought('question -> answer')

response = qa(question="What is the capital of Paris?") #Prompted to vllm_llama2
print(response.answer)
```

2) Generate responses using the client directly.

```python
response = vllm_llama2._generate(prompt='What is the capital of Paris?')
print(response)
```

***

<AuthorDetails name="Arnav Singhvi"/>